Code
reticulate::repl_python()exit
This is a Report Template Quarto
Slides: slides.html ( Go to slides.qmd to edit)
Remember: Your goal is to make your audience understand and care about your findings. By crafting a compelling story, you can effectively communicate the value of your data science project.
Carefully read this template since it has instructions and tips to writing!
Currently a work in progress
For this project we hope to compare the predictive power of a traditional linear model against a graph neural network on weather station data.
The weather station data is comprised of weather station nodes that are within the same geographic location. For the linear model two variants will be produced, one that is trained on all data treating each station as a node and another trained on each station individually.
The GNN will be trained on the entire graph of the weather station data using a dense adjacency matrix with no self connection.
The training will focus on producing forecasts with the performance being compared over a 6 month period on the tail end of the data.
This section will be expanded as the cleaning process is further refined
Detail the models or algorithms used.
Justify your choices based on the problem and data.
The common non-parametric regression model is \(Y_i = m(X_i) + \varepsilon_i\), where \(Y_i\) can be defined as the sum of the regression function value \(m(x)\) for \(X_i\). Here \(m(x)\) is unknown and \(\varepsilon_i\) some errors. With the help of this definition, we can create the estimation for local averaging i.e. \(m(x)\) can be estimated with the product of \(Y_i\) average and \(X_i\) is near to \(x\). In other words, this means that we are discovering the line through the data points with the help of surrounding data points. The estimation formula is printed below (R-base?):
\[ M_n(x) = \sum_{i=1}^{n} W_n (X_i) Y_i \tag{1} \]\(W_n(x)\) is the sum of weights that belongs to all real numbers. Weights are positive numbers and small if \(X_i\) is far from \(x\).
Another equation:
\[ y_i = \beta_0 + \beta_1 X_1 +\varepsilon_i \]
Describe your data sources and collection process.
Present initial findings and insights through visualizations.
Highlight unexpected patterns or anomalies.
A study was conducted to determine how…
reticulate::repl_python()exit
# %pip install polars==1.22.0
# %pip install numpy
# %pip install pandas
# %pip install seaborn
# %pip install matplotlib
# %pip install pyarrow
# %pip install datetime
# %pip install contextily
# %pip install h3==3.7.7
# %pip install shapely
# %pip install geopandas
# %pip install scikit-learn
# %pip install tqdm
# %pip install hypothesis==6.135.26# Standard library imports (e.g., OS, datetime, etc.)
import os
import typing
import datetime
from pathlib import Path
import shutil
# Third-party libraries
import numpy as np
import polars as pl
import seaborn as sns
import pandas as pd
from polars.testing.parametric import columns
from sklearn.preprocessing import minmax_scale, robust_scale
from tqdm import tqdm
import geopy.distance
import scipy
import sklearn.preprocessing as pre
import geopandas
import shapely
from shapely.geometry import Polygon, Point, LineString
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import contextily as ctx
from h3 import h3import datetime
import polars as pl
start_date = datetime.datetime(2010, 1, 1, 0, 0)
end_date = datetime.datetime(2020, 12, 31, 0, 0)
seed = 3435
pl.enable_string_cache()
data_path = r'kansas_asos_2010_2020.csv'metar_schema = {'station': pl.Categorical,
'valid': pl.Datetime,
'lon': pl.Float64,
'lat': pl.Float64,
'elevation': pl.Float64,
'tmpf': pl.Float64,
'dwpf': pl.Float64,
'relh': pl.Float64,
'drct': pl.Float64,
'sknt': pl.Float64,
'p01i': pl.Float64,
'alti': pl.Float64,
'mslp': pl.Float64,
'vsby': pl.Float64,
'gust': pl.Float64,
'skyc1': pl.Categorical,
'skyc2': pl.Categorical,
'skyc3': pl.Categorical,
'skyc4': pl.Categorical,
'skyl1': pl.Float64,
'skyl2': pl.Float64,
'skyl3': pl.Float64,
'skyl4': pl.Float64,
'wxcodes': pl.String,
'ice_accretion_1hr': pl.Float64,
'ice_accretion_3hr': pl.Float64,
'ice_accretion_6hr': pl.Float64,
'peak_wind_gust': pl.Float64,
'peak_wind_drct': pl.Float64,
'peak_wind_time': pl.Datetime,
'feel': pl.Float64,
'metar': pl.String,
'snowdepth': pl.Float64}asos_ldf = pl.scan_csv(data_path, null_values=['T', 'M', '///'], schema=metar_schema)\
.drop('metar')\
.with_columns(pl.col('valid').dt.round('1h').alias('valid'))import numpy as np
full_date_series = np.arange(start_date, end_date, datetime.timedelta(hours=1))
asos_df = asos_ldf\
.collect()\
.select(pl.col('station', 'lat', 'lon', 'elevation'))\
.unique()\
.join(pl.DataFrame({'valid': full_date_series}), how='cross')\
.join(asos_ldf.collect(), on=['station', 'valid'], how='left')\
.with_columns(pl.col('valid').dt.round('6h').alias('valid'))\
.drop('lat_right', 'lon_right', 'elevation_right')\
.group_by(['station', 'valid'])\
.mean()\
.with_columns(pl.col(pl.Float64).cast(pl.Float32))potential_features = asos_ldf.drop('valid', 'station', 'lat', 'lon', 'elevation').collect_schema().names()
feature_list = []
for feature in potential_features:
if not asos_df.select(pl.col(feature).is_null().all()).item():
feature_list.append(feature)
stations_list = asos_df\
.select(pl.col('station'))\
.unique()\
.to_series()\
.to_list()from tqdm import tqdm
def safe_index(item, lst):
return item in lst
year_series = np.arange(start_date.year, end_date.year + 1, 1)
reduced_feature_df = pl.DataFrame(schema={**{'start_year': pl.Int64, 'end_year': pl.Int64, 'year_range': pl.Int64, 'year_label': pl.String, 'feature': pl.String},
**{station: pl.Boolean for station in stations_list}
})
for index, a in enumerate(tqdm(year_series)):
for b in year_series[index:]:
shifted_date_series = np.arange(datetime.date(a, 1, 1), datetime.date(b, 12, 31), datetime.timedelta(hours=6))
year_filter_df = asos_df.filter(pl.col('valid').dt.year().is_between(a, b))
for feature in feature_list:
if (year_filter_df.select(pl.col(feature).null_count()) != year_filter_df.height).item():
asos_pivot_df = year_filter_df\
.pivot(on='station', index='valid', values=feature, aggregate_function='mean')\
.drop('valid')
valid_stations = [s.name for s in asos_pivot_df if not (s.null_count() > len(shifted_date_series)*0.1)]
if len(valid_stations) >= 6:
if asos_pivot_df.var().select(pl.mean_horizontal(pl.all()).alias('mean')).item() > 5:
valid_stations.sort()
new_row = pl.DataFrame({**{'start_year': a,
'end_year': b,
'year_range': (b+1) - a,
'year_label': f'{a}-{b}',
'feature': feature},
**{station: safe_index(station, valid_stations) for station in stations_list}
})
reduced_feature_df = reduced_feature_df.vstack(new_row)
0%| | 0/11 [00:00<?, ?it/s]
9%|9 | 1/11 [00:01<00:10, 1.06s/it]
18%|#8 | 2/11 [00:01<00:08, 1.04it/s]
27%|##7 | 3/11 [00:02<00:06, 1.16it/s]
36%|###6 | 4/11 [00:03<00:05, 1.31it/s]
45%|####5 | 5/11 [00:03<00:04, 1.50it/s]
55%|#####4 | 6/11 [00:04<00:02, 1.75it/s]
64%|######3 | 7/11 [00:04<00:01, 2.07it/s]
73%|#######2 | 8/11 [00:04<00:01, 2.51it/s]
82%|########1 | 9/11 [00:04<00:00, 3.12it/s]
100%|##########| 11/11 [00:04<00:00, 5.01it/s]
100%|##########| 11/11 [00:04<00:00, 2.21it/s]
import matplotlib.pyplot as plt
import seaborn as sns
features = reduced_feature_df.select(pl.col('feature')).unique().to_series().to_list()
n_cols = 3
n_rows = -(-len(features) // n_cols) # Ceiling division
fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols * 4, n_rows * 3), constrained_layout=True,
sharex=True, sharey=True)
axes = axes.flatten()
for idx, feature in enumerate(features):
plot_df = (
reduced_feature_df
.filter(pl.col('feature') == feature)
.drop('feature', 'start_year', 'end_year', 'year_range')
.to_pandas()
)
plot_df.index = plot_df['year_label'].astype(str) # Ensure index is str, not int
plot_df = plot_df.drop('year_label', axis=1)
plot_df = plot_df.sort_index()
ax = axes[idx]
sns.heatmap(plot_df, cmap='magma', annot=True, cbar=False, ax=ax)
ax.set_title(feature)
# Shared axis labels
fig.suptitle("Station Participation by Feature and Year", fontsize=16)
fig.supxlabel("Stations")
fig.supylabel("Year")
plt.show()df_t = reduced_feature_df\
.filter(pl.col('year_label').eq('2018-2020'))\
.drop('year_range', 'year_label', 'start_year', 'end_year')\
.transpose(include_header=True)
new_headers = df_t.row(0)
# Drop the first row
df_t_no_header = df_t.slice(1, df_t.height - 1)
# Assign new headers
df_t_renamed = df_t_no_header.rename({old: str(new) for old, new in zip(df_t_no_header.columns, new_headers)})valid_features = [col for col in df_t_renamed.columns if col != "feature"]
valid_stations = df_t_renamed\
.with_columns(pl.col(valid_features)\
.map_elements(lambda x: x == 'true', return_dtype=pl.Boolean))\
.filter( pl.all_horizontal([pl.col(col) for col in valid_features]))\
.select(pl.col('feature'))\
.to_series()\
.to_list()reduced_asos_df = asos_df\
.filter(pl.col('valid').is_between(datetime.datetime(2018, 1, 1, 0, 0), datetime.datetime(2021, 1, 1, 0, 0)))\
.filter(pl.col('station').is_in(valid_stations))\
.select(pl.col(['station', 'valid', 'lat', 'lon', 'elevation'] + valid_features))\
.sort(['valid', 'station'])\
.with_columns([(pl.col('drct').map_elements(lambda x: np.sin(np.radians(x)), return_dtype=pl.Float64)).alias('drct_sin'),
(pl.col('drct').map_elements(lambda x: np.cos(np.radians(x)), return_dtype=pl.Float64)).alias('drct_cos')])\
.drop('drct')
row_count, feature_count = reduced_asos_df.drop('station', 'valid', 'lat', 'lon', 'elevation').shape
valid_station = reduced_asos_df.select(pl.col('station')).head(7).to_series().to_list()
station_count = len(valid_station)
valid_features = reduced_asos_df.drop('station', 'valid', 'lat', 'lon', 'elevation').columns
# shape is time, station, feature
station_matrix = reduced_asos_df.drop('station', 'valid', 'lat', 'lon', 'elevation').to_numpy().reshape(int(row_count/station_count), station_count, feature_count)reduced_asos_df| station | valid | lat | lon | elevation | tmpf | dwpf | relh | sknt | feel | drct_sin | drct_cos |
|---|---|---|---|---|---|---|---|---|---|---|---|
| cat | datetime[μs] | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f64 | f64 |
| "GCK" | 2018-01-01 00:00:00 | 37.927502 | -100.724403 | 881.0 | 9.283334 | -6.5 | 48.148335 | 8.666667 | -4.161667 | 0.851117 | 0.524977 |
| "LBL" | 2018-01-01 00:00:00 | 37.044201 | -100.9599 | 879.0 | 12.316667 | -1.983333 | 52.014999 | 10.166667 | -1.721667 | 0.664796 | 0.747025 |
| "EHA" | 2018-01-01 00:00:00 | 37.000801 | -101.879997 | 1099.0 | 15.555555 | 5.255556 | 63.382778 | 7.388889 | 4.482222 | 0.970763 | 0.24004 |
| "HQG" | 2018-01-01 00:00:00 | 37.163101 | -101.370499 | 956.52002 | 14.311111 | -1.605556 | 48.468334 | 7.777778 | 2.681667 | 0.936332 | 0.351115 |
| "3K3" | 2018-01-01 00:00:00 | 37.991699 | -101.7463 | 1005.700012 | 13.1 | -0.9 | 53.127777 | 6.777778 | 2.151111 | 0.981255 | 0.192712 |
| … | … | … | … | … | … | … | … | … | … | … | … |
| "EHA" | 2020-12-31 00:00:00 | 37.000801 | -101.879997 | 1099.0 | 42.355556 | 16.588888 | 34.955555 | 3.0 | 40.254444 | -0.533205 | -0.845986 |
| "HQG" | 2020-12-31 00:00:00 | 37.163101 | -101.370499 | 956.52002 | 40.722221 | 13.255555 | 32.23111 | 1.666667 | 39.450001 | 0.977334 | -0.211704 |
| "3K3" | 2020-12-31 00:00:00 | 37.991699 | -101.7463 | 1005.700012 | 40.200001 | 12.2 | 31.377777 | 4.555555 | 36.683334 | -0.700217 | -0.71393 |
| "JHN" | 2020-12-31 00:00:00 | 37.578201 | -101.7304 | 1012.710022 | 40.711113 | 17.4 | 38.554443 | 4.777778 | 37.06889 | -0.824675 | -0.565607 |
| "19S" | 2020-12-31 00:00:00 | 37.496899 | -100.832901 | 892.570007 | 39.922222 | 15.388889 | 36.552223 | 6.111111 | 35.113335 | -0.824675 | 0.565607 |
# as this is true it means there are no time slices where all values are nan
not np.any(np.all(np.isnan(station_matrix), axis=(1, 2)))True
import geopy.distance
# calculate the geodesic distance between two nodes
def compute_node_distance(node1, node2, inverse=False):
coords_1 = (node1[1], node1[0])
coords_2 = (node2[1], node2[0])
# elevation_difference = abs(node1[2] - node2[2])
horizontal_distance = geopy.distance.geodesic(coords_1, coords_2).km
# return math.sqrt(horizontal_distance**2 + (elevation_difference/1000)**2)
if inverse:
try:
horizontal_distance = 1/horizontal_distance
except ZeroDivisionError:
horizontal_distance = 0
return horizontal_distancestation_df = reduced_asos_df.select(pl.col(['station', 'lon', 'lat', 'elevation'])).unique().to_pandas()
grid_list = station_df.loc[:, ['lon', 'lat']].reset_index()[['lon', 'lat', 'index']].to_numpy().tolist()
grid_list = [sublist[:-1] + [int(sublist[-1])] for sublist in grid_list]
result_dict = {'index': [],
'station': []}
result_dict = {**result_dict, **{str(i): [] for _, _, i in grid_list}}
# find the nearest mesh node for every station node
for row_index, station in station_df.iterrows():
for col_index, station2 in station_df.iterrows():
# To get elevation with the new system I would have to find a heightmap for the new lon/lat coordinates
result_dict[str(col_index)].append(compute_node_distance([station['lon'], station['lat']], [station2['lon'], station2['lat']], inverse=True))
result_dict['station'].append(station['station'])
result_dict['index'].append(row_index)
# # generate a dataframe that maps every station node to its relative mesh node with the inverse distance
grid_map_df = pl.DataFrame(result_dict, schema={**{'station': pl.Categorical, 'index': pl.UInt64}, **{str(i): pl.Float64 for _, _, i in grid_list}})from sklearn.preprocessing import MinMaxScaler, minmax_scale
scaled_idistance = minmax_scale(grid_map_df.drop('station', 'index'))
modified_adjacency_df = grid_map_df.select(pl.col(['station', 'index']))\
.join(pl.DataFrame(scaled_idistance, schema=[str(i) for i, _ in enumerate(scaled_idistance)]).with_row_index(),
on='index')from shapely.geometry import Polygon, Point, LineString
import geopandas
import contextily as ctx
# Create Point geometries for each station (lon, lat order)
station_node_list = [Point(lon, lat) for lat, lon in station_df[['lat', 'lon']].to_numpy()]
stations_gdf = geopandas.GeoDataFrame(station_df.copy(), geometry=station_node_list, crs="EPSG:4326")
# Create edges as LineStrings between all unique pairs of stations
edge_list = []
n = len(station_node_list)
for i in range(n):
for j in range(i + 1, n):
edge_list.append(LineString([station_node_list[i], station_node_list[j]]))
edges_gdf = geopandas.GeoDataFrame(geometry=edge_list, crs="EPSG:4326")
# Plot everything
fig, ax = plt.subplots(figsize=(18, 10))
# Plot edges
edges_gdf.plot(ax=ax, color="xkcd:blue", linewidth=1, alpha=1)
# Plot nodes
stations_gdf.plot(ax=ax, color="xkcd:bright orange", markersize=10)
# Add station labels
for i, row in stations_gdf.iterrows():
ax.text(row.geometry.x, row.geometry.y, row['station'], fontsize=9)
# Add basemap
ctx.add_basemap(ax, crs=stations_gdf.crs.to_string())adj = modified_adjacency_df.drop('station', 'index').to_numpy()
adj = adj / adj.sum(axis=1, keepdims=True) # row normalize
def spatial_impute(data, adj):
imputed = data.copy()
T, N, F = data.shape
for t in range(T): # time index
for i in range(N): # station index
for f in range(F): # feature index
if np.isnan(imputed[t, i, f]):
# Get neighbor values and weights
neighbor_vals = imputed[t, :, f]
weights = adj[i]
mask = ~np.isnan(neighbor_vals)
if mask.sum() > 0:
imputed[t, i, f] = np.dot(weights[mask], neighbor_vals[mask]) / weights[mask].sum()
return imputed
def spatiotemporal_impute(data, adj):
data = spatial_impute(data, adj)
# fill remaining NaNs using temporal interpolation
T, N, F = data.shape
for i in tqdm(range(N)):
for f in range(F):
series = data[:, i, f]
mask = ~np.isnan(series)
if mask.sum() == 0:
continue # entire feature missing, can't interpolate
# Linear interpolation
indices = np.arange(T)
data[:, i, f] = np.interp(indices, indices[mask], series[mask])
return data
data_imputed = spatiotemporal_impute(station_matrix, adj)
0%| | 0/7 [00:00<?, ?it/s]
100%|##########| 7/7 [00:00<00:00, 4667.75it/s]
import pandas as pd
pd.DataFrame(data_imputed[:200, :, 2]).plot(legend=False, subplots=True, figsize=(20, 8))array([<Axes: >, <Axes: >, <Axes: >, <Axes: >, <Axes: >, <Axes: >,
<Axes: >], dtype=object)
pd.DataFrame(station_matrix[:200, :, 2]).plot(legend=False, subplots=True, figsize=(20, 8))array([<Axes: >, <Axes: >, <Axes: >, <Axes: >, <Axes: >, <Axes: >,
<Axes: >], dtype=object)
def node_correlation(data, node_number):
T, N, F = data.shape
# Store correlation matrices per node (features x features)
corr_within_node = np.empty((N, F, F))
for node in range(N):
# Slice data for node: shape (T, F)
node_data = data[:, node, :]
# Compute correlation matrix (features x features)
corr_within_node[node] = np.corrcoef(node_data, rowvar=False)
# Example: correlation between feature 0 and feature 1 at node 0
corr_list = []
for f_feature in range(F):
inner_list = []
for s_feature in range(F):
if f_feature < s_feature:
inner_list.append(np.nan)
else:
inner_list.append(float(corr_within_node[node_number, f_feature, s_feature]))
corr_list.append(inner_list)
return corr_list
def between_node_correlation(data, feature_number):
feature_data = data[:, :, feature_number]
corr_between_nodes = np.corrcoef(feature_data.T)
corr_between_nodes[np.triu_indices(corr_between_nodes.shape[0], 1)] = np.nan
return corr_between_nodesfig, axes = plt.subplots(2, 4, figsize=(20, 10))
axes = axes.flatten()
for i in range(data_imputed.shape[1]):
corr_matrix = pd.DataFrame(
node_correlation(data_imputed, i),
columns=valid_features,
index=valid_features
)
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1, ax=axes[i])
axes[i].set_title(f'{valid_station[i]} Correlation')
plt.tight_layout()
plt.show()fig, axes = plt.subplots(2, 4, figsize=(20, 10)) # 2 rows, 4 columns grid
axes = axes.flatten()
for i in range(data_imputed.shape[2]):
corr_matrix = pd.DataFrame(
between_node_correlation(data_imputed, i),
columns=valid_station,
index=valid_station
)
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1, ax=axes[i])
axes[i].set_title(f'Between-Node Correlation: {valid_features[i]}')
plt.tight_layout()
plt.show()from sklearn.preprocessing import robust_scale
T, N, F = data_imputed.shape
imputed_df = pl.from_numpy(data_imputed.reshape(T * N, F), schema=valid_features)\
.drop('dwpf', 'feel')\
.with_columns(pl.DataFrame([valid_station[i % len(valid_station)] for i in range(T*N)], schema={'station': pl.Categorical})\
.join(pl.from_pandas(station_df),
on='station',
how='left'))\
.select(pl.col(['station', 'lon', 'lat', 'elevation', 'tmpf', 'relh', 'sknt', 'drct_sin', 'drct_cos']))\
.drop('lon', 'lat', 'elevation')\
.with_columns(pl.col('relh')/100)
imputed_df = imputed_df.drop('tmpf', 'sknt')\
.hstack(pl.DataFrame(robust_scale(imputed_df.select(pl.col('tmpf'))), schema=['tmpf']))\
.hstack(pl.DataFrame(robust_scale(imputed_df.select(pl.col('sknt'))), schema=['sknt']))\
.select(pl.col(['station', 'tmpf', 'relh', 'sknt', 'drct_sin', 'drct_cos']))
imputed_df| station | tmpf | relh | sknt | drct_sin | drct_cos |
|---|---|---|---|---|---|
| cat | f64 | f64 | f64 | f64 | f64 |
| "GCK" | -1.355721 | 0.481483 | -0.080357 | 0.851117 | 0.524977 |
| "LBL" | -1.265174 | 0.52015 | 0.160714 | 0.664796 | 0.747025 |
| "EHA" | -1.168491 | 0.633828 | -0.285714 | 0.970763 | 0.24004 |
| "HQG" | -1.205638 | 0.484683 | -0.223214 | 0.936332 | 0.351115 |
| "3K3" | -1.241791 | 0.531278 | -0.383929 | 0.981255 | 0.192712 |
| … | … | … | … | … | … |
| "EHA" | -0.368491 | 0.349556 | -0.991072 | -0.533205 | -0.845986 |
| "HQG" | -0.417247 | 0.322311 | -1.205357 | 0.977334 | -0.211704 |
| "3K3" | -0.432836 | 0.313778 | -0.741072 | -0.700217 | -0.71393 |
| "JHN" | -0.417579 | 0.385544 | -0.705357 | -0.824675 | -0.565607 |
| "19S" | -0.441128 | 0.365522 | -0.491072 | -0.824675 | 0.565607 |
# if the station is going to be used for linear regression apply a one-hot encoding to it
imputed_pd_df = imputed_df.to_pandas()
imputed_pl_df = imputed_dfExplain your data preprocessing and cleaning steps.
Present your key findings in a clear and concise manner.
Use visuals to support your claims.
Tell a story about what the data reveals.
Summarize your key findings.
Discuss the implications of your results.